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Abstract 



ZKCM is a C++ library developed for the purpose of multiprecision matrix computation, on the basis of the GNU MP 
and MPFR libraries. It provides an easy-to-use syntax and convenient functions for matrix manipulations including those 
often used in numerical simulations in quantum physics. Its extension library, ZKCM_QC, is developed for simulating 
quantum computing using the time-dependent matrix-product-state simulation method. This paper gives an introduction 
^ about the libraries with practical sample programs. 
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PROGRAM SUMMARY 

Manuscript Title: ZKCM: a CH — h library for multiprecision 
matrix computation with applications in quantum information 
Authors: Akira SaiToh 
'Program Title: ZKCM 
Journal Reference: 
Catalogue identifier: 

Licensing provisions: GNU Lesser General Public License Ver. 
3 

Programming language: C++ 
Computer: General computers 

Operating system: Unix-like systems, such as Linux, Free BSD, 
Cygwin on Windows OS, etc. 

RAM: Several mega bytes - several giga bytes, dependent on 
the problem instance 

Keywords: Multiprecision computing, Linear algebra, Time- 
dependent matrix product state, Quantum information 
Classification: 4.8 Linear Equations and Matrices, 4.15 
Quantum Computing 

■External routines/libraries: GNU MP (GMP) Q], MPFR @] 
Ver. 3.0.0 or later 

Nature of problem: Multiprecision computation is helpful to 
guarantee and/or evaluate the accuracy of simulation results 
in numerical physics. There is a potential demand for a 
programming library focusing on matrix computation usable 
for this purpose with a user-friendly syntax. 
Solution method: A CH — h library ZKCM has been developed 
for multiprecision matrix computation. It provides matrix 
operations useful for numerical studies of physics, e.g., the 
tensor product (Kronecker product), the tracing-out operation, 
the inner product, the LU decomposition, the Hermitian- 
matrix diagonalization, the singular-value decomposition, 
and the discrete Fourier transform. For basic floating-point 
operations, GMP and MPFR libraries are used. An extension 
library ZKCM_QC has also been developed, which employs 
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the time-dependent matrix-product-state method to simulate 
quantum computing. 

Restrictions: Multiprecision computation with more than a 
half thousand bit precision is often a thousand times slower 
than double-precision computation for any kind of matrix 
computation. 

Additional comments: A user's manual is placed in the 
directory "doc" of the package. Each function is explained in 
a reference manual found in the directories "doc/html" and 
"doc/latex". Sample programs are placed in the directory 
"samples" . 

Running time: It takes less than thirty seconds to obtain a 
DFT spectrum for 2 16 data points of a time evolution of a 
quantum system described by a 4 x 4 matrix Hamiltonian 
for 256-bit precision when we use recent AMD or Intel CPU 
with 2.5 GHz or more CPU frequency. It takes three to five 
minutes to diagonalize a 100 x 100 Hermitian matrix for 
512-bit precision using the aforementioned CPU. 
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1. Introduction 

Precision of floating-point operations is sometimes of 
serious concern in simulation physics when rounding er- 
rors in variables or matrix elements considerably af- 
fect numerical results for investigated physical phenom- 
ena [l|. There are several programming libraries, e.g., 
Refs. 11 SI, 1,0,0], for high-precision computing, which 
are helpful in this regard. Among them, the library named 
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ZKCM 0], which we have been developing, is a C++ li- 
brary for multiprecision complex-number matrix computa- 
tion. It provides several functionalities including the LU 
decomposition, the singular value decomposition, the ten- 
sor product, and the tracing-out operation and an easy- 
to-use syntax for basic operations. It is based on the GNU 
MP (GMP) [I] and MPFR 9] libraries, which are com- 
monly included in recent distributions of UNIX-like sys- 
tems. 

There is an extension library named ZKCM_QC. This 
library is designed for simulating quantum computing [lol . 
1 1 111 by the time-dependent matrix-product-state method 
[12] [or, simply referred to as the matrix-product-state 
(MPS) method]. It uses a matrix product state 13, 14, 12] 
to represent a pure quantum state. The MPS method is re- 
cently one of the standard methods for simulation-physics 
software fisj . As for other methods effective for simulat- 
ing quantum computing, see, e.g., Refs. 0, [HI. With 
ZKCM_QC, one may use quantum gates in U(2), U(4), 
and U(8) as elementary gates. Indeed, in general, quan- 
tum gates in U(2) and U(4) are enough for universal quan- 
tum computing [18(, but we regard quantum gates in U(8) 
also as elementary gates so as to reduce computational 
overheads in circuit constructions. 

A simulation of quantum computing with MPS is known 
for its computational efficiency in case the Schmidt ranks 
are kept small during the simulation 12, Il9j . Even for 
the case slightly large Schmidt ranks are involved, it is not 
as expensive as a simple simulation. The theory of MPS 
simulation will be briefly explained in Sec. 14.11 Numeri- 
cal errors will be phenomenologically discussed in Sec. 14.31 
which will give a certain reason why we introduce multi- 
precision computation for an MPS simulation of quantum 
computing. 

This contribution is intended to provide a useful intro- 
duction for programming with the libraries. Section [2] de- 
scribes two examples of the use of the ZKCM library: one 
for showing the precision dependence of a solution of a 
simple linear equation; the other for simulating an NMR 
spectrum in a simple model. Performance evaluation of 
the library is made in Sec. [31 Section 0] shows an overview 
of the theory of the MPS method and an example of sim- 
ulating a simple quantum circuit using the ZKCM_QC li- 
brary. In addition, later in the section numerical errors in 
an MPS simulation of quantum computing are examined 
using a certain setup of a quantum algorithm. We discuss 
on the effectiveness and the performance of our libraries 
in Sec.[5j Section [5] summarizes our achievements. 



2. ZKCM Library 

The ZKCM library is designed for general-purpose ma- 
trix computation. This section concentrates on its main 
library. It has two major CH — h classes: "zkcm_class" 
and "zkcm_matrix" . The former class is a class of a com- 
plex number. Many operators like "+=" and functions like 



trigonometric functions are defined for the class. The lat- 
ter class is a class of a matrix. Standard operations and 
functions like matrix inversion are defined. In addition, the 
singular-value decomposition of a general matrix, the di- 
agonalization of an Hcrmitian matrix, the discrete Fourier 
transform, etc., are defined for the class. Functoins for the 
tensor product (Kronecker product) and the tracing-out 
operation (e.g., one can trace out the subsystem B of sys- 
tem ABC) are also defined. A detailed document is placed 
in the "doc" directory of the package of ZKCM. 

As for installation of the library, the standard process 
"./configure — > make — > sudo make install" works in 
most cases; otherwise the document should be consulted. 

We will next look at simple examples to demonstrate 
the programming style using the library. (Here, the latest 
stable version, ver. 0.3.2, is used.) 

2.1. Program examples 

Example 1. There is a classical example [2pj ] often men- 
tioned to recall the importance of multiprecision compu- 
tation. It clarifies that a sufficiently large precision is re- 
quired even for a simple linear equation with only two 
variables. 

Consider the linear equation 
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with 
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/ 64919121 
^41869520.5 



459018721 
402558961 



The exact solution is x = 205117922, y = 83739041. 

One way to numerically find the solution is to com- 
pute (x,yY = j4 _1 (1,0)* using the matrix inversion. An- 
other way is to utilize the Gaussian elimination (see, e.g., 
Refs. [IISl]). We used functions "inv" and "gauss" of 
ZKCM for the former and the latter ways, respectively (as 
for pivoting, a partial pivoting is used in "gauss"). Then, 
we plotted the computed value of x against the precision 
[bits] as shown in Fig. [1] (Here, we refer to the number of 
bits for a significand of a floating-point number as preci- 
sionQ) We can see that low-precision computing resulted 
in a wrong answer while high-precision computing resulted 
in a correct answer. It should be emphasized that the dou- 
ble precision (namely, 53 bits for the mantissa portion of 
a floating-point number) is not enough in this example. 



1 Precision is defined as an exact length of the mantissa portion 
of a floating-point number in MPFR while it is defined as a least 
length in GMP S§. In ZKCM, a complex number keeps each part 
internally as an MPFR variable so that precision is an exact length. 
However, it should be noted that an output from a function usually 
carries over the best precision among those of involved variables in 
ZKCM. 
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Figure 1: Plot of the computed values of x against the precision (the 
bit length of a mantissa) employed for floating-point numbers. Ei- 
ther (i) the matrix inversion (pluses) or (ii) the Gaussian elimination 
(circles) was used. For case (ii), there are data points with x values 
< —1.0 X 10 s for some values of precision < 50; they are outside the 
range in the figure. 



The program used for this example is shown in Listing[T] 
It is found as "samples/classic.example.cpp" in the pack- 
age. It is written to use the matrix inversion. To use the 
Gaussian elimination, the line "B = inv(A) * B;" should 
be replaced with "B = gauss (A, B);". 



#include " zkcm . hpp " 
#include <f stream > 
int main ( ) 
{ 

std: : of stream ofs ("classic.ex ample .dat ") ; 

if (!ofs) 

{ 

std : : cerr << "Could not create the file\ 
classic .example . dat " 

<< std : : endl ; 

return - 1 ; 

} 

ofs << "# prec x y" << std :: endl ; 

for (int prec = 32; prec < 257; prec++) 

I 

zkcm_set_default_prec (prec) ; 
zkcm_matrix A(2, 2), B(2, 1); 



A(0, 0) = "64919121"; 

A(0, 1) = " -159018721" 

A(l, 0) = "41869520.5" 

A(l, 1) = " -102558961" 

B(0, 0) = 1; 

B (1 , 0) = ; 



B = inv(A) * B; 

//Replace the above line with 

//B = gauss (A , B) ; 

//to use the Gaussian elimination instead. 

ofs << prec << " " 

<< B(0, 0) << " " 

<< B(l, 0) << std :: endl ; 

} 

zkcm_quit ( ) ; 
return 0; 



Listing 1: classic_example.cpp 

The program is compiled and executed in a standard wayjfl 
Its output file "classic_example.dat" can be visualized by 
Gnuplot 23] using the file "classic_example.gnuplot" found 
in the "samples" directory. 

In the program, one can see typical fea- 
tures of the ZKCM library. The line 
"zkcm_set_def ault_prec (prec) ; " sets the default 
internal precision (the default bit length of a significand) 
of an object to "prec". Any object like "A" (this is a 
2x2 matrix) constructed without specified precision 
possesses the default precision. It is possible to specify a 
particular precision (i.e., in case of a matrix object, the 
length of the significand for each part of each element 
of the matrix), say, 512, by constructing the object 
as "zkcm_matrix A (2, 2, 512)", for instance. For 
a matrix object, say, "A", it is convenient to access 
its element by "A(i,j)" that is a reference to 
the element. The element is an object in the type of 
"zkcm_class" . To assign a value into a "zkcm_class" 
object (a complex number) "z" , one can write "z= . . . ; " 
where the right-hand side can be a number or a string 

describing a complex number in the style " + *I" 

(PARI/GP style 01; here, "I" stands for i = V^l), 

(here, "j" stands for i), etc. 
(there are other acceptable styles). In the program, by 
"A(0, 0) = "64919121";" the value 64919121 + Oi is 
assigned to the (0,0) element of matrix A. Other values 
are assigned to corresponding elements in a similar way. 
The solution of the linear equation is obtained by using 
the function "inv" or "gauss". The obtained result 
is written into the output file stream "ofs" by using 
the operator "<<". In the program, the elements are 
individually written out so as to meet the data format of 
a data-plotting program. After computation is performed, 
the program should be terminated. At this stage, it is 
recommended to write "zkcm_quit ( ) ; " so as to release 
miscellaneous memories allocated for internal use of 
background library functions. 



Example 2. As the second example, a sample program 
"NMR_spectrum_simulation.cpp" found in the "samples" 
directory of the package of ZKCM is explained. This 
program generates a simulated free- induction-decay (FID) 
spectrum of liquid-state NMR for the spin system consist- 
ing of a proton spin with precession frequency w\ — 400 
MHz (variable "wl" in the program) and a 13 C spin with 
precession frequency W2 — 125 MHz (variable "w2") at 
room temperature (300 K) (variable "T"). A J coupling 
constant J12 = 140 kHz (variable "J12") is considered for 
the spins. 



2 To make an executable file, the library flags typically "-lzkcm -lm 
-lmpfr -lgmp -lgmpxx" are required. As for ZKCM_QC, additionally 
"-lzkcm_qc" should be specified. 
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The first line of the program is to include a header hie 
of ZKCM: 

#include "zkcm.hpp" 

int main(int argc, char *argv[]) 

{ 

In the beginning of the "main" function, the default pre- 
cision is set to 280 bits by 

zkcm_set_def ault_prec (280) ; 

In the subsequent lines, some matrices like Pauli matrices 
I and X, etc. are generated. For example, X is generated 
as 

zkcm_matrix X = "[0, 1; 1, 0]"; 

using a string representing a matrix in the PARI/GP style. 
The Y90 pulse is generated as 

zkcm_matrix Yhpi(2,2); 

Yhpi(0,0) = sqrt(zkcm_class(0.5)) ; 

Yhpi(0,l) = sqrt (zkcm_class (0 . 5) ) ; 

Yhpi(l,0) = -sqrt (zkcm_class (0 . 5) ) ; 

Yhpi(l,l) = sqrt (zkcm_class (0 . 5) ) ; 

Other matrices are generated by similar lines. After this, 
values of constants and parameters are set. For example, 
the Boltzmann constant fee [J/K] is generated as 

zkcm_class kB("l . 3806504e-23") ; 

The Hamiltonian H of the spin system is generated in the 
type of "zkcm_matrix" and is specified as 

H = wl * tensorprod(Z/2,I) + w2 * tensorprod(I ,Z/2) 
+ J12 * tensorprod(Z/2,Z/2) ; 

This is used to generate a thermal state p: 

zkcm_matrix rho(4,4); 

rho = exp_H((-hplanck/kB/T) * H) ; 

rho /= trace(rho); 

Here, "exp_H" is a function to calculate the exponential of 
an Hermitian matrix and "hplanck" is the Planck constant 
(6.62606896 x 10" 34 Js). The sampling time interval dt 
to record the value of < X > for the proton spin is set 
to 0.43/u>i (any number smaller than 1/2 might be hne 
instead of 0.43) by 

zkcm_class dt (zkcm_class ("0 . 43" ) /wl) ; 

The number of data to record is then decided as 

int N = UNP2(8.0/dt/J12) ; 

Here, function "UNP2" returns the integer upper nearest 
power of 2 for a given number. Now arrays to store data 
are prepared as row vectors. 

zkcm_matrix array(l, N) , array2(l, N) ; 

The following lines prepare the X and Ygo-pulse operators 
acting only on the proton spin. 



zkcm_matrix Xl(4, 4), Yhpil(4, 4); 
XI = tensorprod(X, I); 
Yhpil = tensorprod(Yhpi , I); 

To get an FID data, we firstly tilt the proton spin by the 
ideal Y90 pulse. 

rho = Yhpil * rho * adjoint (Yhpil) ; 

Now the data of time evolution of < X > of the proton 
spin under the Hamiltonian H is recorded for the time 
duration N x dt using 

array = rec_evol (rho , H, XI, dt, N) ; 

We now use a zero-padding for this "array" so as to en- 
hance the resolution. This will extend the array by N 
zeros. 

array2 = zero_padding (array, 2*N) ; 

To obtain a spectrum, the discrete Fourier transform is 
applied. 

array2 = abs (DFT(array2) ) ; 

This "array2" is output to the file "example_zp.fid" as 
an FID spectrum data with df — 1/(2 x N x dt) as the 
frequency interval, in the Gnuplot style by 

GP_lD_print(array2, 1 . 0/dt/zkcm_class (2*N) , 
1, "example_zp.f id") ; 

At last, the function "main" ends with "zkcm_quit () ; " 
and "return ; " . The program is compiled and executed 
in a standard way. 

The result stored in "examplc_zp.fid" is visualized by 
Gnuplot using the file "example_zp. gnuplot" placed in the 
directory "samples" , as shown in Fig. [5J 
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Figure 2: Plot of the simulation data stored in "example_zp.fid" . See 
the text for the details of this simulation. 

It should be noted that we have not employed 
a high-temperature approximation. Under a high- 
tcmpcraturc approximation, the first order deviation —f3H 
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of exp(— j3H) (here, j3 = h/{k&T)) is considered as a devi- 
ation density matrix and calculations are performed using 
the normalized deviation density matrix — TJ/const. This 
approximation is commonly used [28] but it cannot be used 
for simulations for low temperature. An advantage of us- 
ing ZKCM for simulating NMR spectra is that the tem- 
perature can be chosen. This is possible because of high 
accuracy in computing the exponential of a Hamiltonian. 

As we have seen in this example, ZKCM has useful func- 
tions to handle matrices, especially those often used in 
simulations in quantum physics. For the list of available 
functions, see the reference manual found in the package. 

3. Performance evaluation 

We evaluate the performance of ZKCM (version 0.3.2) 
in comparison with PARI [24| (version 2.5.3 with the 
CMP kernel), a conventional highly-evaluated C library 
for mathematical computing. 

3.1. Speed of computing the T function 

Here, we evaluate the performance of the function 
"gamma" of ZKCM, which calculates T(z) for a complex 
number z. It is implemented on the basis of the expansion 
using the Stirling formula for Re z > 0: 



T(z) 



~ z z z 1 v / 27rzexp 



.71=1 



\B2n\z 1 



2n(2n - 1) 



Rl 



where B m is the first or the second Bernoulli number (as 
the absolute value is used, either is fine); L is a sufficiently 
large (but not very large) integer to guarantee the accu- 
racy; Rl is the remainder of the sum (corresponding to the 
sum of the terms for n = L + 1, . . . , oo). It is known that 
the expansion is not convergent since \B2 n \ grows rapidly 
for large n. It was, however, proved by Spira [25j that 
\Rl\ < \§^\z\ 1 - 2L holds when Rez > 0. Thus the ex- 
pansion is usable for computation with enough accuracy 
as long as z is appropriately scaled to a large value in 
advance. We internally scale z to satisfy Rez > and 
\z\ > 10 3 ~ 2 10 . In accordance with this scaling, L is ap- 
propriately chosen so that Rl becomes small enough for 
required precision. For the scaling, the well-known formu- 
lae Y(z) = T(z + k)/[z{z + 1) • • • (z + k - 1)] (here, k is 
an integer) and T{z) = — n/[— zT(— z) sin(— ttz)] (here, z is 
not a pole) are utilized. To compute Bernoulli numbers, 
we utilize the relation 



B m — 1 — 2^ 



B k 



n-k + 1 



and the fact that -B2.J+1 vanishes for integer j > 1. We keep 
the computed values of binomial coefficients and Bernoulli 
numbers inside a certain subblock of a function for com- 
puting T(z). Therefore, the cost to compute B m assum- 
ing that we have already computed Bq, . . . , i? m _2 for even 



number m is small: 0(m) rational-number operations are 
enough^ (we make use of the C++ class "mpq_class" of 
CMP for internally handling rational numbers). 

Now, the speed of computing T(z) using "gamma" is eval- 
uated for z = a + bi with random a,b € [—10,10]. It is 
compared with the function with the same name, "gamma" , 
of the PARI library. The PARI library uses a similar al- 
gorithm to compute T(z). The main difference is that 
PARI can utilize cached values of Bernoulli numbers that 
are once calculated in any function during a process is 
running. It can also use precomputed values of typical 
constants. 

As shown in Table [TJ the average computation time of 
r(z) in the ZKCM library is on the same order of magni- 
tude as that in the PARI library in the absence of cached 
values of Bernoulli numbers. However, in the presence 
of cached values of Bernoulli numbers, the computation 
speed of T(z) in PARI is quite much faster. 

3.2. Speed of matrix multiplication 

Here, we evaluate the speed of matrix multiplication. 
We first generate a 100 x 100 random Hermitian matrix A 
satisfying vTrAAt = 1 . We then continue to compute the 
operations (i) A <— A 2 and (ii) A ■<— A/VTrAA^ sequen- 
tially, i.e., (i) (ii) (i) (ii) ... from left to right. Operations 
are performed under a specified precision "prec" . We com- 
pare time consumption to perform ten sets of "(i) (ii)". 
Time spent for the matrix preparation is not included. As 
shown in Table [3J computation time is on the same or- 
der of magnitude for ZKCM and PARI and discrepancy 
in time consumption is not large. This result is plausible 
since both the libraries employ a straight-forward way to 
implement matrix multiplication and rely on the GMP li- 
brary for the speed of primitive operations. (ZKCM uses 
MPFR for basic operations that is based on GMP; hence 
ZKCM indirectly uses primitive operations of GMP.) 

3. 3. Speed of Hermitian-matrix diagonalization 

We next evaluate the speed of diagonalization of an 
Hermitian matrix for finding all the eigenvectors. The 



3 For other use than computing I'(z), we cannot use this as- 
sumption in general. In this case, Akiyama-Tanigawa algorithm 26] 
is employed to calculate the Bernoulli number. This takes 0(m' 2 ) 
rational-number operations. 

4 This was because the improvement in running time due to set- 
ting compiler optimization flags was negligible. (Of course, there 
were other compiler flags for library specifications etc. which are 
not of our concern in this context.) For instance, a program to find 
eigenvectors of a normalized random 50 X 50 Hermitian matrix using 
the function "diag_H" of ZKCM ("eigen" of PARI) with 768-bit pre- 
cision consumed 34.558 (14.550) seconds on average when compiled 
without optimization flags, and consumed 34.314 (14.551) seconds 
on average when compiled with the optimization flags "-03 -fno- 
strict-aliasing -fomit-frame-pointer" . (Here, we used the Frobenius 
norm for normalization.) The average was taken over 100 trials. The 
standard deviations were 0.0648 (0.143) and 0.0645 (0.122), respec- 
tively. The compiler was GCC ver. 4.6.3 installed by default on the 
operating system Fedora 15 64-bit. A computer with the Intel Core 
i5 M460 2.53 GHz CPU (maximum clock frequency 2.80GHz) and 
4GB RAM was used. 
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Table 1 : Comparison of the average real time consumption for com- 
puting r(z) for z = a + bi with random a, b E [—10, 10]. The average 
was taken over 1,000 trials. No I/O operation was involved. Time 
for generating z was negligible (this was no more than 2.0 X 10 — 5 
seconds). The standard deviation (we employ the sample standard 
deviation) for each entry is shown in parentheses in small fonts un- 
der the entry, "prec" stands for the precision (the bit length of 
each mantissa) employed in a program. In a program using PARI, 
function "nbits2prec" was used to convert the precision into the 
number of code words that is defined as the precision for variables in 
PARI. ZKCM version 0.3.2 and PARI version 2.5.3 were used. They 
were compiled with GCC/GH — h with their default optimization flags, 
while test programs for this performance evaluation were compiled 
without optimization flags. 4 In this table, the middle and the right 
columns show the results for PARI with and without precomputcd 
values of Bernoulli numbers, respectively. Environments: For up- 
per entries (♦): Fedora 15 64- bit operating system on Intel Core i5 
M460 CPU (2.53 GHz, Max. 2.80GHz) with 4GB RAM. For lower 
entries (Jk): Fedora 16 64-bit operating system on AMD FX-8120 
CPU (3.10GHz, Max. 4.00GHz) with 16GB RAM. 



prec 


ZKCM [sec] 


PARI with- 
out cache 
[sec] 


PARI with 

cache [sec] 


256 


♦3.73 x 10" 3 

(8.72 x 10 -5 ) 

*2.95 x 10~ 3 

(9.58 X 


1.38 x 10~ 3 

(7.91 x 10~ 5 ) 

1.42 x 10~ 3 

(2.73 x 10~ 4 ) 


1.63 x 10~ 4 

(6.10 x 10~ 5 ) 

1.48 x 10~ 4 

(9.56 x 10 -5 ) 


512 


9.04 x 10" 3 

(2.95 x 10~ 4 ) 

7.15 x 10~ 3 

(1.64 x 10~ 4 ) 


4.70 x 10" 3 

(6.79 x 10~ 4 ) 

3.12 x 10" 3 

(6.78 x 10~ 4 ) 


3.43 x 10~ 4 

(1.49 x 10~ 4 ) 

3.43 x 10~ 4 

(1.75 x 10~ 4 ) 


768 


1.86 x 10~ 2 

(2.62 x 1CT 4 ) 

1.45 x 10~ 2 

(2.80 x 10~ 4 ) 


9.58 x 10~ 3 

(1.82 x 10~ 3 ) 

8.38 x 10~ 3 

(1.85 x 10~ 3 ) 


6.61 x 10~ 4 

(3.67 x 10 -4 ) 

6.02 x 10~ 4 

(2.07 x 10 -4 ) 


1024 


3.48 x 10~ 2 

(8.77 x 10~ 4 ) 

2.59 x 10~ 2 

(3.67 x 10~ 4 ) 


1.48 x 10" 2 

(8.32 x 10~ 4 ) 

1.34 x 10~ 2 

(6.35 x 10~ 4 ) 


1.25 x 10" 3 

(3.39 x 10~ 4 ) 

1.03 x 10~ 3 

(3.64 x 10~ 4 ) 


1280 


5.37 x 10~ 2 

(1.41 x 10~ 3 ) 

3.92 x 10~ 2 

(9.90 x 10~ 4 ) 


2.44 x 10" 2 

(9.68 x 10~ 4 ) 

2.09 x 10~ 2 

(1.56 x 10~ 3 ) 


1.95 x 10" 3 

(6.87 x 10~ 4 ) 

1.70 x 10~ 3 

(7.23 x 10~ 4 ) 



Table 2: Comparison of the average real time consumption to per- 
form ten sets of operations (i) and (ii) (see the text). The average 
was taken over ten different A's (again, see the text). The standard 
deviation is shown in parentheses in small fonts, "prec" stands for 
the precision (the bit length of a significand) . Environments: Fedora 
15 OS on Intel Core i5 M460 CPU with 4GB RAM for upper entries 
(♦) and Fedora 16 OS on AMD FX-8120 CPU with 16GB RAM for 
lower entries (A). 



prec 


ZKCM [sec] 


PARI [sec] 


256 


♦26.1 (o.ii5) 

*21.4 (0.248) 


15.0 (0.204) 
13.0 (0.567) 


512 


35.5 (0.206) 
27.3 (0.202) 


21.9 (0.0208) 
18.7 (0.188) 


768 


46.4 (o.i6i) 

36.9 (0.266) 


30.8 (0.0702) 
28.1 (0.247) 


1024 


65.8 (o.ii5) 
53.2 (o.i78) 


50.5 (0.199) 
41.3 (0.434) 


1280 


86.5 (0.0265) 

69.3 (o.isi) 


67.8 (0.118) 

54.9 (o.7i5) 



functions used for this task are "diag_H" of ZKCM and 
"eigen" and "jacobi" of PARI. 

Let us briefly explain our design of "diag_H". In the 
function, the routine to diagonalize an Hermitian matrix 
A employs a standard QR method (see, e.g., Refs. 31, 32| ) 
to find eigenvalues, for which Householder reflections and 
Wilkinson shifts are utilized. The eigenvalues Xi are sorted 
from larger to smaller. Then each corresponding eigenvec- 
tor \vi) is calculated by the inverse iteration. During this 
process, we set 

A <— A + const, x \vi-i){vi-i\ 

before calculating \vi). This resolves degeneracy (in case 
we have), so that calculated eigenvector \vi) becomes or- 
thogonal to Iwo), \ v i-i) w ith high accuracy. We still 
test orthogonality and, in case required, perform an or- 
thogonalization of |ui)'s. Enhancement in accuracy of Xi 
is achieved during the inverse iteration steps. The steps 
are retried, often with an initial vector set to the vector 
found in the previous round of steps, and sometimes with a 
randomly chosen initial vector, until sufficient convergence 
of \vi) and Ai for required precision is reached. 

As for the functions of PARI, "eigen" firstly computes 
the roots of the characteristic polynomial and secondly 
uses the Gaussian elimination to find eigenvectors; in con- 
trast, "jacobi" simply uses the Jacobi method. There 
are known facts on PARI library functions [29(: func- 
tion "eigen" uses a naive algorithm so that it fails to 
compute eigenvectors for matrices with degenerate eigen- 
values; function "jacobi" handles real symmetric matri- 
ces only, so that an Hermitian matrix A should be re- 

, , . . . . , . fRe(A) -lm(A)\ 

formed into a symmetric matrix 1 1 



see 



\lm(A) Re (A) ) ^ 
Ch. 11.5 of Ref. [30| for details about this reformation) as 
a workaround to find eigenvalues and eigenvectors using 
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the function. 

We compare the average time consumption to find all 
the eigenvectors of a random 100 x 100 Hermitian matrix 
A with a unit Frobenius norm for several different internal 
precisions. To generate A, matrix elements = a* t are 
simply randomly generated and then the normalization of 
the matrix is performed. The eigenvectors found by a di- 
agonalization are the column vectors of unitary matrix U 
such that U~ 1 AU — diag. 

The probability for a random matrix to be degenerate 
is very small. Tested matrices were in fact nondegenerate 
and "eigen" worked fine except for the precision 256 [bits] 
for which it stopped due to a low-accuracy error. As for 
"jacobi", we used the workaround as mentioned above 
so that the matrix actually input into the function was a 
200 x 200 real symmetric matrix. In this case, we had to 
double the precision of the matrix in order to keep the off- 
diagonal elements of D = U~ 1 AU sufficiently small (i.e., 
\Dij t ijtj\ < 2 - ( prcc-10 )) for the specified precision prec. 
This was not a theoretical consequence but an empirical 
workaround for the use of " j acobi" . In fact, with precision 
512 (1024) [bits], "jacobi" constantly achieved |-D»j,»yy| ~ 
1.0 x 10~ 78 - 1.0 x 2~ 256 (\Dij,i& | - 1.0 x 10~ 155 - 1.0 x 
2 -512 ). Thus, it was reasonable that doubled precision was 
handed over to "jacobi" when it was called. 

As shown in Table [31 time consumption for "diag_H" 
of ZKCM is on the same order of magnitude as those for 
the functions "eigen" and "jacobi" of PARI, as far as we 
tested for the precision between 256 and 1280 [bits]. 

In addition, we also compare the average time consump- 
tion to find all the eigenvectors of a random N x N Hermi- 
tian matrix with a unit Frobenius norm for several different 
values of N with fixed precision 768 [bits] . We doubled the 
precision for "jacobi" due to the above-mentioned reason. 
As shown in Table [H time consumption of "diag_H" is on 
the same order of magnitude as those for "eigen" and 
"jacobi" as far as we tested for 25 < N < 125. 

The evaluated functions are all designed to find eigen- 
vectors in addition to eigenvalues. One may think of the 
case only eigenvalues are needed, for which shorter com- 
putation time is expected. The PARI library does not 
have a function for this purpose. The ZKCM library has 
the function "eigenvalues_H" which is created by simply 
omitting the inverse iterations from "diag_H" . The prob- 
lem is that the precision of computed eigenvalues cannot 
be guaranteed without corresponding eigenvectors. The 
achievable precision in the absence of inverse iterations is 
evaluated as below together with the time consumption. 

Numerical error in the absence of inverse iterations. The 
performance of "eigenvalues_H" is evaluated in Tables 
[5] and El in comparison with "diag_H". The real time 
consumption to find all the eigenvalues of a normalized 
randomly-generated N x N Hermitian matrix and the nu- 
merical error in the computed eigenvalues are used for this 
comparison. Here, the matrix was generated by a random 



Table 3: Comparison of the average real time consumption to find all 
the eigenvectors of a normalized random 100 X 100 Hermitian matrix. 
The average was taken over ten different matrices. The standard 
deviation is shown in parentheses in small fonts, "prec" stands for 
the precision. Environments: Fedora 15 OS on Intel Core i5 M460 
CPU with 4GB RAM for upper entries (♦) and Fedora 16 OS on 
AMD FX-8120 CPU with 16GB RAM for lower entries (#). Note: 
For precision 256 [bits], function "eigen" of PARI stopped with an 
error and output no result. * Precision was doubled for "jacobi" (see 
the text). 



prec* 


ZKCM, 


PARI, 


PARI, 




diag_H [sec] 


eigen [sec] 


jacobi [sec] 


256 


♦175 (o.io5) 

*137 (0.769) 


N/A (N/A) 
N/A (N/A) 


103 (0.324) 
91.2 (0.372) 


512 


259 (o.i60) 


171 (1.27) 


265 (0.974) 


203 (i.4i) 


145 (0.426) 


202 (0.502) 


768 


413 (0.378) 


237 (i.24) 


477 (2.io) 


330 (4.14) 


206 (0.764) 


367 (5.43) 


1024 


632 (0.358) 


378 (i.58) 


726 (2.24) 


501 (4.14) 


309 (i.i4) 


506 (3.65) 


1280 


903 (0.315) 


503 (i.2i) 


1020 (5.47) 


704 (3.53) 


404 (i.i3) 


734 (7.98) 



Table 4: Comparison of the average real time consumption to find all 
the eigenvectors of a normalized random NxN Hermitian matrix un- 
der the fixed precision 768 [bits] [precision was doubled for "jacobi" 
(see the text)]. The average was taken over ten different matrices. 
The standard deviation is shown in parentheses in small fonts. Envi- 
ronments: Fedora 15 OS on Intel Core i5 M460 CPU with 4GB RAM 
for upper entries (♦) and Fedora 16 OS on AMD FX-8120 CPU with 
16GB RAM for lower entries (*). 



N 


ZKCM, 


PARI, eigen 


PARI, jacobi 




diag_H [sec] 


[sec] 


[sec] 


25 


♦3.41 (0.0152) 

*2.68 (o.oi30) 


0.941 (0.00406) 
0.841 (0.0399) 


5.99 (0.0699) 

4.79 (0.0219) 


50 


34.5 (0.0624) 


14.5 (0.0248) 


50.9 (0.364) 


27.1 (0.174) 


12.8 (0.0483) 


40.9 (o.ii4) 


75 


146 (0.238) 


74.3 (0.636) 


182 (0.977) 


114 (0.604) 


64.9 (0.345) 


144 (0.390) 


100 


413 (0.378) 


237 (i.24) 


477 (2.io) 


330 (4.14) 


206 (0.764) 


36 7 (5.43) 


125 


961 (i.io) 


596 (i.72) 


1070 ( 7.40) 


752 (4.73) 


508 (2.14) 


754 (4.62) 
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Table 5: Comparison of the average real time consumption for 
"eigenvalues.!!" and "diag_H" to find all the eigenvalues of a nor- 
malized random 100 X 100 Hermitian matrix. The average numerical 
error in the computed spectrum is also shown in the columns "Er- 
ror" . The average was taken over ten different matrices. In each cell, 
the standard deviation is shown in parentheses in small fonts, "prec" 
stands for the precision. Environments: Fedora 15 OS on Intel Core 
i5 M460 CPU with 4GB RAM for upper entries (♦) and Fedora 16 
OS on AMD FX-8120 CPU with 16GB RAM for lower entries (*). 





eigenvalues_H 


diag_H 


prec 


Time 

[sec] 


Error 




Time 
[sec] 


Error 






♦9.24 


1.70 x 10 


-32 


173 


7.24 x 10 


-76 


256 


(0.250) 

*7.26 


(2.98 x 10" 

1.19 x 10 


32 j 

-32 


(1.05) 

136 


(8.69 x 10" 

3.83 x 10 


76) 

-76 




(0.149) 


(2.00 x 10" 


32 j 


(1.13) 


(2.01 x 10" 


76) 




13.2 


1.32 x 10 


-49 


259 


3.86 x 10 


-153 


512 


(0.194) 

10.4 


(2.99 x 10" 

3.63 x 10 


49) 

-49 


(1.60) 

202 


(1.26 x 10" 

3.96 x 10 


153^ 

-153 




(0.142) 


(6.70 x 10" 


49) 


(1.88) 


(2.00 x 10" 


153^ 




19.1 


9.35 x 10 


-57 


413 


3.01 x 10 


-230 


768 


(0.368) 
15.2 


(8.62 x 10" 

3.71 x 10 


57) 

-56 


(2.61) 

327 


(1.12 x 10" 

4.88 x 10 


230) 

-230 




(0.272) 


(8.16 x 10" 


56) 


(2.82) 


(5.43 x 10" 


230 j 




27.6 


2.08 x 10 


-56 


637 


4.03 x 10 


-307 


1024 


(0.454) 

22.1 


(3.71 x 10" 

4.00 x 10 


56) 

-56 


(3.27) 

498 


(3.58 x 10" 

3.55 x 10 


307) 

-307 




(0.346) 


(7.49 x 10" 


56 \ 


(2.99) 


(3.01 X 10" 


307) 




35.9 


6.00 x 10 


-56 


904 


6.65 x 10 


-384 


1280 


(0.871) 

28.4 


(9.69 x 10" 

9.84 x 10 


56) 

-54 


(4.00) 

706 


(1.18 x 10" 

2.11 x 10 


383) 

-384 




(0.539) 


(3.10 x 10" 


53) 


(4.64) 


(6.12 x 10" 


385) 



unitary transformation of a diagonal matrix D with ran- 
dom real elements, where D was normalized by its Frobe- 
nius norm in advance. The numerical error in the com- 
puted spectrum {e'i} for "eigenvalues_H" is quantified 
by E({e'j}) = J2 i |ej — e'j| where a are the true eigenval- 
ues. Here, {e'i} and {e^} are sorted in the same order (we 
employed the descending order). Similarly, the numeri- 
cal error in the computed spectrum {e"i} for "diag_H" is 
quantified by E({e"i}) = J2i l e « — e "i\- For each cell of the 
tables, we performed the same simulation ten times with 
different random matrices and took the average for the 
time consumption or the numerical error. Note that this 
performance evaluation was separately performed with the 
evaluations shown in the previous tables. The matrix size 
N was fixed to 100 and several values of precision were 
tried for Table [5] the precision was fixed to 768 [bits] and 
several values of N were tried for Table [6] 

It has turned out that "eigenvalues_H" is faster than 
"diag_H" by one or two orders of magnitude to compute 
eigenvalues. The only difference in the internal struc- 
tures of these functions is the inverse iterations to find 
eigenvectors and at the same time enhance the accuracy 



Table 6: Comparison of "eigenvalues.!!" and "diag_H" in real time 
consumption to find all the eigenvalues of a normalized random N x 
N Hermitian matrix, for several different matrix sizes N for the 
fixed precision 768 [bits]. The average over ten trials is shown. The 
columns "Error" show the average numerical error in the computed 
spectrum. The standard deviation is shown in parentheses for each 
value. Environments: Fedora 15 OS on Intel Core i5 M460 CPU 
with 4GB RAM for upper entries (♦) and Fedora 16 OS on AMD 
FX-8120 CPU with 16GB RAM for lower entries (*). 





eigenvaluesjf 


diag_H 


N 


Time 

[sec] 


Error 


Time 

[sec] 


Error 






♦0.438 


4.46 x 10" 57 


3.34 


9.10 x 10 


-231 


25 


(0.0179) 

*0.343 


(8.47 x 10" 57 ) 

3.42 x 10~ 57 


(0.0253 

2.69 


(3.84 x 10" 

6.35 x 10 


231 ) 

-231 




(0.0153) 


(8.76 x 10" 57 ) 


(0.0302 


(1.72 x 10" 


231 ) 




2.75 


1.67 x 10" 5e 


34.0 


2.28 x 10 


-230 


50 


(0.0704) 

2.19 


(3.14 x 10" 56 ) 

9.67 x 10~ 57 


(0.220) 

27.2 


(2.51 x 10" 

2.08 x 10 


230 j 

-230 




(0.0824) 


(1.49 x 10" 56 ) 


(0.190) 


(1.02 x 10" 


230^ 




8.44 


7.30 x 10" 5e 


144 


3.16 x 10 


-230 


75 


(0.149) 

6.73 


(1.58 x 10" 55 ) 

2.64 x 10~ 56 


(0.762) 

114 


(3.58 x 10" 

2.40 x 10 


230) 

-230 




(0.163) 


(3.56 x 10 -56 ) 


(0.692) 


(1.03 x 10" 


230) 




19.1 


9.35 x 10~ b ' 


413 


3.01 x 10 


-230 


100 


(0.368) 

15.2 


(8.62 x 10" 57 ) 

3.71 x 10~ 56 


(2.61) 

327 


(1.12 x 10" 

4.88 x 10 


230) 

-230 




(0.272) 


(8.16 x 10" 56 ) 


(2.82) 


(5.43 x 10" 


230 j 




36.4 


1.22 x 10" 55 


950 


3.76 x 10 


-230 


125 


(0.794) 

28.4 


(2.28 x 10" 55 ) 

1.52 x 10~ 56 


(5.58) 

753 


(1.75 x 10" 

3.15 x 10 


230) 

-230 




(0.478) 


(2.08 x 10" 56 ) 


(5.25) 


(8.01 x 10" 


231 ) 
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of eigenvalues. Thus the result shows most of time con- 
sumption in "diag_H" is spent for inverse iterations. A 
drawback to use "eigenvalues_H" is the nonnegligible er- 
ror in the computed eigenvalues. For instance, we have 
< E({e'j}) >^ 10~ 49 when the precision is 512 [bits], i.e., 
when the machine epsilon is ~ 1CP 154 . Indeed, the error 
is much smaller than the machine epsilon for double pre- 
cision (~ 10~ 16 ), but is not in the acceptable range for 
multiprecision computation. 

So far, the ZKCM library has been introduced and eval- 
uated on its performance. An extension library for the 
study of quantum computation will be introduced in the 
next section. 



As for installation of the ZKCM_QC library, 
the standard process "./configure — > make — > 
sudo make install" should work; if not, the docu- 
ment should be consulted. 

We briefly overview the theory of the MPS simulation 
before introducing a program example. 

4-1. Brief overview of the theory of time- dependent MPS 
simulation 

Consider an n-qubit pure quantum state 



I*) 



l-.-i 

E c *°- 
.'1=0— 



_! 1*0 ' ' ' in-l) 



4. ZKCIVLQC library 

The ZKCM.QC library is an extension of the ZKCM 
library. It has several classes to handle tensor data use- 
ful for the (time-dependent) MPS simulation of a quan- 
tum circuit. Among the classes, the "mps" class and the 
"tensor2" class will be used by user-side programs. The 
former class conceals the complicated MPS simulation pro- 
cess and enables writing programs in a simple manner. 
The latter class is used to represent two-dimensional ten- 
sors which are often simply regarded as matrices. A quan- 
tum state during an MPS simulation can be obtained as 
a (reduced) density matrix in the type of "tensor2" . For 
convenience, there are functions to convert a matrix in 
the type of "tensor2" to the type of "zkcm_matrix" and 
vice versa. More details of the classes are explained in the 
document placed in the "doc" directory of the ZKCM_QC 
package. 

It might be curious to condensed-matter physicists why 
multiprecision computation is needed in an MPS simu- 
lation. Indeed, truncations of Schmidt coefficients (and 
corresponding Schmidt vectors) have been studied as a 
possible source of errors [33j while precision of basic 
floating-point operations has not been of main concern in 
physics simulations using the MPS data structure. Pre- 
cision shortage cannot be a main source of errors in light 
of truncation errors. However, it has been uncommo 
to use a truncation of nonzero Schmidt coefficients in 
time-de pen dent MPS simulations of quantum computing 
Under this condition, precision shortage be- 
comes the only crucial source of errors and it is hence of 
our main concern how much precision is required for an ac- 
curate simulation of quantum computing. We will discuss 
more on this issue in Sec. 14.31 In particular, we will show 
an example of simulating a quantum algorithm for which a 
truncation of at most a single nonzero Schmidt coefficient 
for each step results in a significant error; in addition, a 
precision slightly beyond the double precision is necessary 
for this example. 



with E in ...i 



1. If we keep this state as 



5 Here, we are considering the standard quantum circuit model 
for quantum computing. It is not the case in a Hamiltonian-bascd 
adiabatic-evolution model |34|. 



data as it is, updating the data for each time of unitary 
time evolution spends 0(2 2n ) floating-point operations. 
To avoid such an exhaustive calculation, in the matrix- 
product-state method, the data is stored as a kind of com- 
pressed data. The state is represented in the form 



I*) 



1 

£• 

io=0 



1 

E 

n-X=0 



mo — 1 mi — 1 



E 



vq—O v±— 



m n -2— 1 

• E 

V„-2=0 



Qo(io,vo)Vo(vo)Qi(ii,v ,vi)Vi(vi) ■ ■ ■ 

■■■V n -2(v n -2)Qn-l{in-l,Vn-2) I 



(i) 



where we use tensors {Q s }™=o with parameters i s , v s -i, v s 
(parameters w_i and v n —i exceptionally do not exist) and 
{V s }™Zq with parameter v s ; m s is a suitable number of 
values assigned to v s with which the state is represented 
precisely or well approximated. This form is one of the 
forms of matrix product states (MPSs). The data are kept 
in the tensors. We can see that neighboring tensors are 
correlated to each other; the data compression is owing to 
this structure. 

Let us explain a little more details: Q s {i s , v s -\, v s ) is a 
tensor with 2 x m s _i x m s elements; V s (v s ) is a tensor in 
which the Schmidt coefficients for the splitting between the 
sth site and the (s+l)th site (i.e., the positive square roots 
of nonzero eigenvalues of the reduced density operator of 
qubits 0, . . . , s) are stored. This implies that, by using 
V s and eigenvectors |3>°;" s ) (|3>* +1 '" n_1 )) of the reduced 
density operator p°— 8 (p s+1 — n_r ) of qubits 0, . . . , s (s + 
1, . . . , n — 1), the state can also be written in the form of 
Schmidt decomposition 



I*) = 



V s (v s )\*° v7 s W v 



s+l. 



(2) 



In an MPS simulation, very small coefficients and cor- 
responding eigenvectors can be truncated out unlike a 
usual Schmidt decomposition. It may happen that all the 
nonzero coefficients are nonnegligible. This is actually the 
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case in practice when we simulate quantum computing as 
we will discuss in Sec. 14.31 We can still enjoy a consid- 
erable data-size reduction since vanishing coefficients are 
truncated out. 

Besides the truncation, a significant advantage to use 
the MPS form is that we have only to handle a small num- 
ber of tensors when we simulate a time evolution under 
each quantum gate. For example, when we apply a uni- 
tary operation g U(4) acting on, say, qubits s and s + 1, 
we have only to update the tensors Q s (i s ,v s _i, v s ), V s (v s ), 
and <5s+i(*s+i, v a , v s+ i). For the details of how tensors 
are updated, see Refs. [l2|, • The simulation of a sin- 
gle quantum gate e U(4) spends 0(m^ lax ) floating-point 
operations where m max is the largest value of m s among 
the sites s. (Usually, unitary operations € U(2) and those 
€ U(4) are regarded as elementary quantum gates.) A 
quantum circuit constructed by using at most g single- 
qubit and/or two-qubit quantum gates can be simulated 
within the cost of 0(<?nm max max ) floating-point opera- 
tions, where n is the number of wires and m. m ax,max is 
the largest value of m max over all time steps. 

The computational complexity may be slightly differ- 
ent for each software using MPS. In the ZKCM.QC li- 
brary, we have functions to apply quantum gates € U(8) 
to three chosen qubits. Internally, three-qubit gates are 
handled as elementary gates. This makes the complex- 
ity a little larger. A simulation using the library spends 



max, max 



) floating-point operations, where g is the 



0(gr 

number of single-qubit, two-qubit, and/or three-qubit 
gates used for constructing a quantum circuit. As the pro- 
cess to update the tensors to simulate a three-qubit gate 
has not been written in detail in the literature as far as 
the author knows, it is explained in | Appendix A[ 

The MPS simulation process, which is in fact often com- 
plicated, can be concealed by the use of ZKCM.QC. One 
may write a program for quantum circuit simulation in 
an intuitive manner. Here is a very simple example. In 
the followings, we use version 0.1.0 of ZKCM_QC for our 
programs. 

4-2. Program example 

As a simple example, we simulate the time evolution 

|000)£ J=(|000) + |100)) 
cnot J_(|ooo) + |101)) 



where Hadamard operation H = ( j 



acts on 



qubit and CNOT = |00)(00| + |01)(01| + |10)(11| + |11)(10| 
acts on qubits and 2. Then we see the reduced density 
matrix of qubits and 2. 

The program example is shown in Listing [2] (A slightly 
extended sample code is placed in the "samples" directory 
of the ZKCM_QC package.) It utilizes several matrices 



declared in the namespace "tensor2tools" (see the doc- 
ument placed in the "doc" directory for the details of this 
namespace) . 



#include " zkcm_qc . hpp " 

int main (int argc , char *argv[]) 
{ 

//Use the 256-bit precision. 
zkcm_set_def ault_prec (256) ; 

//Num. of digits for each output is set to 8. 
zkcm_set_output_nd (8) ; 

//First , we make an MPS representing |000>. 
mps M (3) ; 

std : : cout << "The initial state is " 

<< std : : endl ; 
//Print the reduced density operator of the 
//block of qubits from to 2, namely, 0,1,2, 
//using the binary number representation for 
//basis vectors . 

std::cout << M . RD0_block (0 , 2) . str_dirac_b () 
<< std : : endl ; 

std :: cout << "Now we apply H to the 0th qubit." 

<< std : : endl ; 
M . applyU (tensor2tools::Hadamard, 0); 

std:: cout << "Now we apply CNOT to the \ 
qubit s and 2 . " 

<< std : : endl ; 
M . applyU (tensor2tools : : CNOT , 0, 2); 

//The array is used to specify qubits to compute 
//a reduced density matrix. It should be 
//terminated by the constant mps : : TA . 
int array [] = {0, 2, mps::TA}; 
std :: cout << "At this point, the reduced \ 
density matrix of the qubits and 2 is " 
<< std : : endl 

<< M.RD0C array) . str_dirac_b () 

<< std : : endl ; 
zkcm_quit ( ) ; 
return 0; 

} 

Listing 2: qc_simple_example.cpp 

The program is compiled and executed typically in the 
following way. 

[user@localhost foo]$ C++ -o qc_simple_example \ 
qc_simple_example . epp -lzkcm_qc -lzkcm -lmpfr -lgmp \ 
-lgmpxx 

[user@localhost foo]$ . /qc_simple_example 

The initial state is 

1 . 0000000e+00 1 000X000 1 

Now we apply H to the 0th qubit . 

Now we apply CNOT to the qubits and 2. 

At this point , the reduced density matrix of \ 

the qubits and 2 is 

5 . 0000000e-01 1 00X00 I +5 . 0000000e-01 1 00X11 1 \ 
+5 . 0000000e-01 1 11X00 1 +5 . 0000000e-01 1 11X11 1 

Besides this example, the package of ZKCM_QC has 
sample programs for simulating Grover's quantum search 
381 ] and simulating a simple projective measurement. In 
addition, we will see an example to simulate the quantum 
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search under the simplest setting in | Appendix A.T which 
is written as a part of the explanation of how to handle a 
three-qubit gate in an MPS simulation. 

4-3. Source of numerical errors in an MPS simulation of 
quantum computing 
It is a standard strategy in the MPS simulation and re- 
lated methods in computational condensed matter physics 
to impose a certain threshold m tra nc to the number of 
Schmidt coefficients; only larger mt ranc Schmidt coeffi- 
cients (and corresponding Schmidt vectors) are employed 
and the remaining are discarded at each time when ten- 



sors are updated [13j, [14| . It has been tacitly assumed that 
truncations are the main source of numerical errors and a 
numerical error due to precision shortage has not gathered 
attention. Venzl et al. [33j pointed out that hardness of a 
time-dependent MPS simulation depends on the distribu- 
tion of Schmidt coefficients. They reported certain param- 
eter choices in the tilted Bose-Hubbard model, for which 
the distribution has a rather long tail for larger Schmidt 
coefficients, i.e., negligibly small ones are not dominant. In 
such a case, mtrunc should be rather close to the maximum 
Schmidt rank (over all splittings between consecutive sites 
and over all time steps) in order to keep the truncation 
error small. In contrast, a usual time-dependent MPS sim- 
ulation for other nearest-neighbour coupling models stud- 
ied so far uses a relatively small value for mtrunc, typically 
fifty to one hundred [35| . With this much value of mtrunc , 
a time-dependent MPS simulation usually does not exhibit 
notable accumulation of numerical errors unless more than 
several hundred time steps of evolution is tried [3|| . A sig- 
nificant accumulation of truncation errors were reported 



for a larger number of time steps [35l [3 



As mentioned before, it has been uncommon to use trun- 
cations in (time-de pen dent) MPS simulations of quantum 
computing (l2l. [lil l27l]. This is reasonable because quan- 
tum algorithms involves many H and CNOT operations 
(see Sec. 14.21 for their definitions). This makes a quan- 
tum state evolving under a quantum circuit quite often 
a nearly equally biased superposition of several indices. 
By tracing-out some subsystem under this condition, we 
have a reduced density matrix whose nonzero eigenvalues 
are all nonnegligible. Thus it is not possible to truncate 
out even a single nonzero Schmidt coefficient, irrespective 
of the number of time steps. In addition, we have re- 
cently shown [37| that an accumulation of errors due to 
precision shortage is significant even without truncation of 
nonzero Schmidt coefficients as far as we have seen in an 
example to iterate the quantum search routine f38j . Thus 
high-precision computation has a practical advantage in 
an MPS simulation for a quantum circuit model with a 
large circuit depth. 

Here, we show a typical example where truncating out 
at most a single nonzero Schmidt coefficient for each time 
causes a significant error. In this example, slightly more 
than double precision is required for a stable simulation 
although the depth of the quantum circuit is not very large. 
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(b) 

Figure 3: (a) Quantum circuit of the Deutsch-Jozsa algorithm for 
the specified function (see the text), (b) Internal structure of gate 
g. Gate g~ is the reverse of g. 



We consider the Deutsch-Jozsa algorithm [39(. In a 
brief explanation, there is a promise that function / : 
{0, 1}' {0, 1} is either balanced (i.e., #{x|/(x) = 0} = 
#{x|/(x) = 1}) or constant (i.e., /(x) is same for all x) 
where x £ {On • • ■ 0;_i, . . . , lo • • ■ The task is to de- 

cide whether a given function / is balanced or constant. 
This takes 1 + 2 l ~ 1 queries for the worst case in classical 
computation while it takes only a single query in quantum 
computation with the Deutsch-Jozsa algorithm. The al- 
gorithm is described as follows: (i) Apply H® l VfH® 1 to / 
qubits initially in the state |0n • • ■ 0j_i) where Vf is an op- 
eration to put the factor (— to each |x); (ii) Measure 
the I qubits in the computational basis. When / is bal- 
anced, the probability of finding the qubits simultaneously 
in 0's in this measurement vanishes; when / is constant, it 
is exactly unity. For the details of the theoretical analysis 
of the algorithm, see, e.g., Sec. 3.1.2 of Ref. (Toj . 

Let us consider a particular function /(y ■ • • yjy — i) = 

©ilV 1 d(Yi) w ith g(x xix 2 x 3 ) = (x A xi) V (xi A x 2 ) V 
(x2 A x 3 ) where y i £ {0, l} 4 and xj £ {0, 1}; N g is a 
positive integer (here, symbol @ stands for the exclusive 
OR operation). Figure shows the quantum circuit of 
the algorithm for this function. This function is balanced 
for any value of N g > 1. In the figure, a single • and 
the connected © represent the CNOT operation with the 
control qubit specified by • and the target qubit specified 
by ffi. It flips the target qubit if the control qubit is in 
|1). Similarly, two »'s and © connected to each other rep- 
resent the CCNOT operation. It flips the target qubit if 
the two control qubits are in |11). (See also | Appendix A| 
for the implementation of three-qubit operations in our 
library.) By the structure of the circuit, each of the N g 
measurements should report Prob(0000) = if there is no 
numerical error. (This is easily proved: assuming different 
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values of Prob(OOOO) for two different bundles of qubits 
contradicts to the fact that the bundles are equivalent to 
each other by the circuit structure.) 

It is straight-forward to write a program code for the 
quantum circuit using the ZKCM.QC library. We begin 
with header file descriptions: 

#include <zkcm_qc . hpp> 
#include <sys/time .h> 

The second header file is needed by the following function 
to obtain the current time in seconds: 

double current_time_in_sec () 
{ 

timeval T; 

: :gettimeofday(&T, NULL); 
return ( (double) (T . tv_sec) 

+ 1.0e-6 * (double) (T.tv_usec)) ; 

} 

We use this function when we record the data of time 
consumption for Table [7J We may omit it otherwise. The 
main function begins with 

int main (int argc, char *argv[]) 

zkcm_set_def ault_prec (prec) ; 

with some precision "prec" and ends with 

zkcm_quit () ; 
return 0; 

} 

We describe the quantum circuit step by step in 
the middle part of the main function. (We call 
"current_time_in_sec ()" before and after this part to 
calculate the time consumption.) First of all, an object of 
the MPS data structure for n = 9N g + 2 qubits is created 

by 

mps M(n) ; 

M. set_m_trunc (mtrunc) ; 

where we also set the value of m trU nc by our option as in 
the second line. Then we write each gate operation one by 
one. For example, a Pauli X gate (or a bit flip) acting on 
the ith qubit is written as 

M.applyU(tensor2tools: :PauliX, i) ; 

Similarly, an Hadamard gate operation is written in the 
same manner with tensor2tools : : Hadamard. A CNOT 
gate with control qubit c and target qubit t is written as 

M.applyU(tensor2tools: :CN0T, c, t) ; 

A CCNOT operation with control qubits a, b, and target 
qubit t is written as 

M.CCNDTCa, b, t) ; 



After writing instructions corresponding to individual 
quantum gate operations, we need some code lines to ob- 
tain the probability of finding zeros in the output. Al- 
though we have a function to perform a projective mea- 
surement acting on a single qubit, this is not conve- 
nient. We instead compute the (0, 0) element of the re- 
duced density matrix of qubits of our interest. To obtain 
Prob(0o0i0203), for example, we write 

std::cout « "Prob(0_0 0_1 0_2 0_3)=" 

« M.RD0_block(0, 3).get(0, 0) 
« std: : endl ; 

In this way, one can easily write a program code for the 
quantum circuit. 

In the present context, it is also demanded to see the 
intrinsic data of the MPS step by step. To obtain the 
number m s of surviving Schmidt coefficients V s (v s ) for the 
splitting between sites s and s + 1, one may write, e.g., 

int num = M.get_m(s) ; 

for some integer < s < n — 2. In addition, each Schmidt 
coefficient is accessible by "M.V(s)(i)" which is a ref- 
erence to the zth element of V s . It is also convenient 
to use "M.get_m_maxO" (rn ma x = max s m s for the cur- 
rent time step) and "M.get_m_maxmaxO" (m max .max = 
max t max s m s where t stands for time) so as to find the 
time step where the Schmidt rank reaches the maximum. 
It should be noted that sometimes the maximum Schmidt 
rank appears during the internal process hidden behind 
each gate operation. This is because each gate operation 
for nonconsecutive qubits needs to internally move them to 
consecutive places before operation and move them back 
to their original places after operation owing to the one- 
dimensional data structure of MPS. This is done by swap- 
ping consecutive qubits step by step. For tracing the inter- 
nal time evolution of m s 's, one needs to mimic this process 
by using the operation "M. SWAP(a,b) ;" (this swaps spec- 
ified qubits a and b) manually. 

We have explained how the program code is written 
to simulate the quantum circuit of Fig. [31 Now we set 
N g to 7; we have 65 qubits in total in the circuit. It is 
certainly intractable to handle the circuit by the brute- 
force method because the dimension of the Hilbert space 
is 2 65 ~ 3.69 x 10 19 . It was numerically found, however, 
that the largest possible Schmidt rank of the MPS dur- 
ing evolution in the circuit is only 28 as shown in Fig. 0J 
Because of this reason, the MPS simulation of the algo- 
rithm took only approximately seven minutes (see Table 
[7]) when the precision was set to 256 [bits] and no trunca- 
tion of nonzero Schmidt coefficients was employed. Here, 
we used a computing server with the Red Hat Enterprise 
Linux 6 64-bit OS, Intel Xeon E7-8837 2.67GHz (2.80GHZ 
maximum) CPU, and 315GB RAM, instead of the common 
PCs that we normally usedH 



This was because a PC with 4GB RAM became unstable due to 
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Figure 4: Error in the computed value of Prob(0o0i0203) and 
m ma x max as functions of mtrunc- Precision was fixed to 256 bits. 



The main aim of the simulation is to investigate if 
any truncation of nonzero Schmidt coefficients is possi- 
ble. Figure [4] shows the error in the computed value of 
Prob(0o0i0203) (the discrepancy from zero) as a function 
of m trU nc- The maximum Schmidt rank m maXjmax observed 
during the simulation is also plotted in the figure as a func- 
tion of mtrunc- It is clearly shown that m tr unc should be 
equal to or more than the exactly maximum Schmidt rank 
that is observed in the absence of truncations; otherwise 
a significant numerical error appears. (In Ref. [37| . we 
have shown a similar result for a smaller circuit of the al- 
gorithm for a different balanced function.) As previously 
mentioned, the distribution of nonzero Schmidt coefficients 
is closely related to this phenomenon. We show the distri- 
bution in Fig. [5l which was taken at the point where the 
second CNOT gate was being processed among the 2N g + 1 
CNOT gates in the middle part of Fig. It clearly de- 

picts that none of the nonzero Schmidt coefficients was 
negligible. 

In addition, it was found that precision slightly more 
than the double precision was required to perform a stable 
simulation even when we did not impose any truncation 
of nonzero Schmidt coefficients, as shown in Table [7] For 
precision less than or equal to 55 bits, some of matrix 
diagonalization routines failed after the half point of the 
circuit, which indicates an accumulated error due to preci- 
sion shortage. More specifically speaking about this case, 
a small non-Hermitian fraction due to the accumulated er- 
ror in a reduced density matrix resulted in the convergence 
error of eigenvectors during an update of tensors. 
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Figure 5: Distribution of nonzero Schmidt coefficients at the point 
where the second CNOT gate was being processed among the 27V 9 + 1 
CNOT gates (N g was set to 7) in the middle part of Fig.^a). Note 
that we are employing the definition of Schmidt coefficients with 
square roots in this paper. Thus the sum of squared values of the 
Schmidt coefficients equals to one. 



Table 7: Precision (prec) dependence of the numerical error (Error) 
in the computed value of Prob(0o0i0203). No truncation of nonzero 
Schmidt coefficients was employed. The real time consumed for the 
simulation is also shown (Time), which is an average over 10 trials of 
the same simulation (the standard deviation is shown in the paren- 
theses). The simulated circuit had 65 qubits because N g was set to 
7. Environment: Red Hat Enterprise Linux 6 64-bit OS, Intel Xeon 
E7-8837 2.67GHz (2.80GHZ maximum) CPU, and 315GB RAM. 



memory shortage for this simulation for precision > 1024 bits. Each 
run of the simulation in the computing server consumed at most 
approximately 8GB memory space for precision 1024 bits and 1280 
bits. 



prec 


Error 


Time [sec] 


< 55 


Convergence failure in 
some of eigenvectors 


N/A 


56 


1.80 x 10" 32 


341 (io.4) 


57 


5.66 x 10~ 26 


335 (10.2) 


58 


1.72 x 10~ 32 


338 (7.82) 


59 


1.55 x 10~ 33 


379 (6.45) 


60 


1.92 x 10~ 34 


348 (13.3) 


256 


5.91 x 10" 139 


416 (8.50) 


512 


7.16 x 10~ 121 


709 (i9.o) 


768 


1.78 x 10~ 460 


922 (19.7) 


1024 


6.50 x 10~ 615 


1620 (12.3) 


1280 


7.77 x 10" 770 


2170 (16.9) 
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5. Discussion 

5.1. Discussion on the ZKCM library 

With the ZKCM library, standard tasks in numerical 
matrix calculations can be completed within the time con- 
sumption on the same order of magnitude as that with 
the PARI library as we have seen in the comparison per- 
formed in Sec. [3] The comparison also showed that ZKCM 
is slower than PARI, even in a simple matrix multiplica- 
tion, which suggests that basic arithmetic operations make 
a certain gap in speed of the two libraries. In fact, ZKCM 
uses MPFR functions while PARI uses GMP functions 
for the basic arithmetic computation. MPFR is approxi- 
mately two and a half times as slow as GMP in multiplying 
floating-point numbers as far as we testedQ Thus it is rea- 
sonable to find a certain gap in computational speed. We 
believe the gap is acceptable as long as it does not change 
the order of magnitude of computation time for basic ma- 
trix computation. It has been probably a historical reason 
that PARI has not used MPFR so farH We chose MPFR 
because it has floating-point exceptions and other useful 
functionalities for real floating-point numbers by default. 

The only unexpected result in Sec. [3] is the one about 
Hermitian matrix diagonalization shown in Tables [3] and 
[¥] It has been shown that time consumptions of functions 
"diag_H" of ZKCM and "eigen" and "jacobi" of PARI 
are on the same order of magnitude. This is an unusual 
result because the Jacobi method employed in "jacobi" 
looks as fast as a variant of the QR method employed in 
"diag_H" for a 100 x 100 matrix. 

It is likely that "diag_H" and "jacobi" are used for 
a similar purpose or under a similar situation, since they 
both work fine for a matrix with degenerate eigenvalues. 
(It is known [29[ that "eigen" almost always fails for a 
matrix with degenerate eigenvalues.) In this point of view, 
too, it is meaningful to compare "diag_H" and "jacobi". 

As mentioned above, function "jacobi" of PARI is an 
implementation of a standard Jacobi method. Function 
"diag_H" of ZKCM employs the Householder-QR method 
with the Wilkinson shift for computing approximate eigen- 
values and uses several sets of inverse iterations to find 
eigenvectors and at the same time to enhance the accu- 
racy of eigenvalues. It has been commonly known that the 
Jacobi method is slower than the QR method: "For matri- 
ces of order greater than about 10, however, the algorithm 



7 We wrote a test program to compute x < — x * y with y = 
1.00 — 1.00 X 10~ 7 for 10 s times where the initial value of x was 
set to 1. The precision was set to 512 bits. It took 15.3 seconds 
on average (with standard deviation 1.09 X 10 _1 ) when we used 
MPFR's "mpf r_t" structure for floating point numbers while it took 
only 6.11 seconds on average (with standard deviation 4.91 X 10 -2 ) 
when we used GMP's "mpf _t" structure. The average was taken over 
ten trials. This test was performed on a machine with the Fedora 15 
64-bit OS, Intel Core i5 M460 CPU, and 4GB RAM. GMP version 
4.3.2 and MPFR version 3.0.0 were used. 

8 PARI has been developed since 1979 [24|]. GMP appeared in 
1991 0| and MPFR appeared in 2000 0- PARI already had plenty 
of fast floating-point functionalities at the time MPFR appeared. 



is slower, by a significant constant factor, than the QR 
method..." — page 571 of Ref. [3(j; "...the Jacobi method 
is several times slower than a reduction to tridiagonal form, 
followed by Francis's algorithm." — page 488 of Ref. (ipj . 
Thus, it is unusual that, for a relatively large matrix with 
the order 100, there is no significant advantage in using 
the Houscholdcr-QR method. In a phenomenological ex- 
planation, this is due to the large cost of inverse iterations. 
By using the Gprof program [4l| (a monitoring software), 
it turned out that, in "diag_H", more than 71.1 percent of 
running time was spent for the inverse iterations in case of 
a 100 x 100 matrix with the precision > 512 [bits]. In fact 
speedup of one or two orders of magnitude was possible 
by omitting inverse iterations for finding eigenvalues as we 
have shown in Tables [5] and |5] It was however impossible 
to achieve a required precision without inverse iterations 
as clearly shown in the tables. 

Thus, it was unexpectedly expensive to achieve a suffi- 
cient convergence by inverse iterations. We, of course, used 
the LU decomposition to reduce the cost of each inverse 
iteration. This, however, did not mitigate the total cost of 
the inverse iterations sufficiently. As a matter of fact, the 
total number of the inverse iterations had to be increased 
along with the increase in required precision. This, at a 
glance, does not look in accordance with a conventional 
theory of inverse iteration 42[ suggesting that the ma- 
chine epsilon does not make a difference in the process of 
inverse iteration. The conventional theory, however, con- 
siders the process where the inverse iteration is used for 
computing an eigenvector for a given approximate eigen- 
value. In our case, in contrast, we also improve accuracy of 
the eigenvalue using the computed eigenvector. A routine 
of inverse iteration is often called several times in order for 
achieving sufficient accuracy for each pair of an eigenvalue 
and the corresponding eigenvector. Furthermore, program 
routines called by the routine of inverse iteration are not 
error-free in practice. Thus it was not surprising that the 
choice of precision considerably affected the cost of inverse 
iterations (and hence the time consumption of "diag_H" 
shown in Table [5]). 

Another factor that makes inverse iterations expensive 
is the cost of basic arithmetic operations in multipreci- 
sion computing. Unlike double-precision computing where 
every basic arithmetic instruction is performed within a 
few clock cycles, it takes quite many cycles!! It is fast in 
double-precision computing to successively perform arith- 
metic instructions because this does not involve any con- 
ditional branching. In contrast, it is inevitable to involve 
several conditional branching instructions in every arith- 
metic operation in multiprecision computing. Thus, the 
simple fact that basic arithmetic instructions are not hard- 
ware instructions should be a large factor. 



9 Theoretically, each instruction with precision prec takes 
0(prec c ) time with c < 2 dependent on the chosen algorithm. In 
real CPU time, it should scale slightly better in practice because of 
vectorization of operands. See the manual of GMP [8§ for the details 
of algorithms for arithmetic instructions. 
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In future, we should find a strategy for faster Hermi- 
tian matrix diagonalization. As a possible approach, the 
precision of computation for finding approximate eigen- 
values can be internally enlarged so that a relatively small 
number of subsequent inverse iterations should be enough 
to achieve a required precision. Table [5j however, shows 
that accuracy of approximate eigenvalues is not very much 
enhanced by simply increasing the preset precision in the 
Householder-QR method. So far, the author could not 
find the reason of this tendency. A detailed theoreti- 
cal and practical analyses of the present implementation 
(and perhaps the method itself) will be required to over- 
come this difficulty. For another approach, there is a 
possibility that a certain method uncommon under the 
double-precision environment possesses an advantage un- 
der a high-precision environment in a practical sense. It 
is of interest to re-investigate practical usefulness of con- 
ventional methods [43| for eigenvalue problems under a 
high-precision environment. 

Apart from computational speed, the syntax of a library 
is also important for usability. As a C++ library, ZKCM 
provides an easy-to-use syntax for handling matrices by 
operator overloading. It also provides various functionali- 
ties for matrix computation as member functions of a class 
as well as as external functions. Thus a program code 
using the library should look simpler than those written 
with some other multiprecision libraries for Fortran or C 
languages. In general, it is relatively easy to write an ex- 
tension library of a C++ library. As we have introduced, 
ZKCM has an extension library named ZKCM_QC devel- 
oped for a (time-dependent) MPS simulation of quantum 
circuits. The main library and the extension are under 
steady development. Latest development versions can be 
downloaded from the repositories linked from the URL Q ■ 

5.2. Discussion on the ZKCM_QC library 

As for the ZKCM_QC library, it is worth mentioning 
that multiprecision computation is useful in MPS simu- 
lations of quantum computing as discussed in Sec. 14.31 
Indeed, truncation errors are dominant whenever trunca- 
tions of nonzero Schmidt coefficients are employed. It is 
however uncommon to employ such truncations in simulat- 
ing quantum computing since they cause an unacceptably 
large error as we have discussed. As a consequence, the 
rounding error become the only possible error. In our ex- 
ample, slightly more than double precision was required 
for reliable MPS simulation. This is true in a different 
example to simulate quantum search, which is shown in 
our related contribution [37(. Thus, time-dependent MPS 
simulation is fragile not only against accumulating trunca- 
but also against accumulating rounding 
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tion errors 

errors. High-precision computation is therefore beneficial 
to the MPS method. 

Apart from the MPS method, there have been several 
simulators of quantum computing which make use of par- 
allel programming techniques 4J, |45j, |46j, |47| to mitigate 
the time consumption of the brute-force method. They 



use a parallelized exact matrix computation and spend a 
massive computational resource {e.g., more than one thou- 
sand CPU processes and several hundred gigabytes physi- 
cal memory to handle less than forty qubits [45j]). It seems 
that an accumulation of rounding errors is not significant 
for these simulators although they use double precision 
computation, as this was not reported so far. In contrast 
to their approach, use of the MPS method is quite eco- 
nomical. As we have shown in Sec. 14.31 we could handle 
65 qubits in the MPS simulation of the Deutsch-Jozsa al- 
gorithm under a certain setup, which ran as a single CPU 
process^ and took only (approximately) seven minutes in 
case of 256-bit floating-point precision. 

Besides the MPS method, Viamontes et al.'s method 
Hi 5(| 57 1 using a compressed graph representation is also 
quite economical to simulate quantum computation. Ac- 
cording to Refs. 16, 57j], its compressed data structure is 
sensitive to rounding errors so that they used multipreci- 
sion computation based on the GMP library. The MPS 
data structure |T]) is of course a compressed data struc- 
ture for which we needed multiprecision computation for 
stable quantum circuit simulation. It will be interesting to 
theoretically investigate if a simulation of quantum com- 
puting utilizing a compressed data structure generally has 
a certain inevitable sensitivity to rounding errors. 

Finally, we discuss on the computational cost of the 
MPS method, which is known to grow polynomially in 
the maximum Schmidt rank m maxjnal during the simu- 
lation [13] as mentioned in Sec. 14.11 It is expected that 
MPS simulation becomes very expensive for quantum cir- 
cuits of algorithms for ha rd p roblems like those for quan- 
tum prime factorization 48[ of a large composite num- 
ber. It has been discussed [49| that large entanglement 
(this usually leads to a large Schmidt rank) must be in- 
volved in quantum prime factorization. So far, Kawaguchi 
et al. [Bp] ] found Schmidt rank 92 in a modular exponenti- 
ation circuit (this is used in quantum prime factorization) 
with 35 qubits in their MPS simulation. Nevertheless, 
it has been neither proved nor numerically verified that 
wi maXimax grows exponentially in the number n of qubits as 
far as the author knows. Although there have been several 
studies on entanglement during quantum prime factoriza- 
tion 51, 52, 5^, 54 1, they have not reached an answer to 
how entanglement grows in n. It is still an open problem 
how we rigorously estimate the value of ra maXjmax for a 
given quantum circuit. Presently, a known upper bound 
for m maXimax is a function exponentially growing in the 
number of basic quantum gates overlapping to each other 
(i.e., those stretching across the same bundle of wires) 
[551 ] : this is however not practically useful for a user of 
the MPS method to estimate the value of m maXimax for 
his/her simulation. More theoretical and numerical efforts 
are required to understand the scalability of the method. 



10 Every simulation with ZKCM or ZKCM.QC shown in this paper 
was run as a single CPU process. 
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6. Summary 

We have introduced the ZKCM library which is a C++ 
library developed for multiprecision complex-number ma- 
trix computation. It is especially usable for high-precision 
numerical simulations in quantum physics; it has an easy- 
to-use syntax for matrix manipulations and helpful func- 
tionalities like the tensor-product and tracing-out opera- 
tions, the discrete Fourier transform, etc. An extension 
library ZKCM_QC has also been introduced, which is a li- 
brary for a multiprecision time-dependent matrix-product- 
state simulation of quantum computing. It enables a user- 
friendly coding for simulating quantum circuits. 

Appendix A. Updating tensors of an MPS for 
simulating a three-qubit gate oper- 
ation 

The library ZKCM_QC uses three-qubit gates as ele- 
mentary gates in addition to single- and two-qubit gates. 
As it is not usually explained in detail how a three-qubit 
gate is simulated in a time-dependent MPS simulation, a 
detailed explanation is given here. As for simulations of 
single- and two-qubit gates, see Refs. 12|, |27j for detailed 
explanations. 

Consider the quantum gate 
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acting on the three qubits I + 1, and 1 + 2. With a focus 
on the three qubits, the MPS of n qubits can be written 
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with Itfc-'- 1 ) = ^-xfa-i)!*"- 1 - 1 ) and h^ 3 -"- 1 } = 
^+ 2 (^ +2 )|^+ 3 --™- 1 ), where l^;:';'" 1 ) are the Schmidt 
vectors of the block of qubits 0, . . . , I — 1 for the splitting 
between l — l and I, and I'&i^j'"'"" 1 ) are those of the block 
of qubits 1 + 3, . . . , n— 1 for the splitting between I + 2 and 

; + 3. 

What we should do as a simulation of applying the quan- 
tum gate U to Y&) is to update tensors Qi, VJ, Qi+i, Vj+i, 
and Qi +2 to Qi, V u Qi+i, Vj+i, and Qi+2, respectively so 
that the MPS with the updated tensors represents the re- 
sultant state |\&). This process is explained hereafter step 
by step. (Here is some note on the description: In case 
l — l = —1, the tensor V;_i should be regarded as unity 
and the parameter vi—± should be dropped. Similarly, in 



case I + 2 = n — 1, the tensor Vj+2 should be regarded as 
unity and the parameter vi+2 should be dropped.) 

The state after U is applied to the qubits I, I + 1, and 
I + 2 can be written as 

l*)=EE E ©(Mi+ij^+Zf^-ijVJ+a) 

vi-i vi+2 hii+iii+2 (A.l) 

x l«i°!:r i ~ 1 >l^>l*i+i}|*i+2>|^+2'""™~ :l ) 



with the tensor 

Q(ii,ii +1 ,ii + 2,vi- 1 ,vi +2 ) = ^2Y1 E 

Vi v t+ i kiki + 1 ki +2 
X Qi +1 (kl +1 ,Vl,Vl +1 )Vl + l(vi +1 )Ql +2 (k l+ 2,Vl +1 ,Vl + 2) 



First, we are going to compute the tensors Qiiii, vi) 
and Vi(vi) of the resultant state. The reduced density 
matrix of qubits 0, I calculated from Eq. (| A. 1 11 is 



A" - E 

ijJJj-li'l^'i-i 



E [^+2(«i+a)] S 



= E E M+afa+2)] a 

%lVi_ii' l-l il + lil+2^1+2 

x @(ii,ii+i,ii+2, u/+2)9* (ii,H+i,ii+2,v' i-i,v l+2 ) 



x ^i(t)n)^iKn) 



l^,_ 1 )|ii><^ l _ 1 K* / i| 



with 1 



= E i(+1 i !+2 „ !+2 •••_)• The matrix p '-'' 
is a (2to/_i) x (2mi—i) matrix. This is now diagonalized to 
achieve the eigenvalues A^, = [Vj 2 and the correspond- 
ing eigenvectors " : ') under the basis 

Immediately we find the values of Vi(yi). We may trun- 
cate out negligibly small eigenvalues 11 ! to reduce mi (the 
updated value of m; , namely the number of data in Vi ) . 
In addition, we have vector elements Ci(ii, Vi-i, vi) of just 
computed eigenvectors: 

|5S;-- , ) = C7 I (» I) t; I _ 1) i; I )|*S;r 1 ' , - 1 >|ii), 
from which we can derive 

Qi{ii,vi-i,vi) = Ci(ii,vi-i,vi)/Vi-i(vi-i). 



11 We should not employ a threshold for the number of eigenvalues 
as we have discussed in Sec. 1 1.31 
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Thus we have obtained 
to,...J\ 



into the above equation to obtain 



l$;r" ) = Ol(*l»»l-i>«l)l«l-i>l*l>- 

Second, we are going to compute the tensors Vi+i{vi+x) 

and Qi+2(h+2,vi + i,vi + 2) from Eq. (|A.1[) . The reduced 
density matrix of qubits I + 2, . . . , n — 1 is 



J+2,...,ra-l 



E 



E M-ifa-O] 5 



x 0(zj, ii +2 , Vi-i, Vi +2 )Q* (ii,ii + i,i'i + 2,vi-i,v'i +2 ) 

X \i l+2 )\v l+2 )(i'l +2 \{v' l+2 \ 

E [ E W-i(^-0] 2 

il+2l>!+2i'l+2l>'l+2 M( + l<->(-l 

x 9(ij,«/+i, ii+2, uj-i, ^+2)9* (ij, ij+i, i'i+2) u'/+2) 



x V/ +2 (^+2)^+2 (f';+2) 



l*i+2)|*«, +3 )(* / i+2|(*«', +a | 



E ^i + 2«l + 2i'l + 2«'i + 2 Nl+2) |$tl l+ a) (*V2 I (^Vj+2 I 
ii+2fi+2i'!+2f'!+2 



with 6 



„/+2,...,n-l 



»( + 2«( + 2l 1+2" 1 + 2 



The matrix 



is a (2to/+2) x (2m;+2) matrix. This is now di- 
agonalized to achieve the eigenvalues A„ !+1 = [V/+i(u/+i)] 2 
and the corresponding eigenvectors |$|+ 2 ^— > n_1 ) under the 

basis {1^+2)1^^+2)}- The values of V/+i (u/+i) are immedi- 
ately found. We may truncate out negligibly small eigen- 
values to reduce mj+i (the updated value of m;+i). In 
addition, we have vector elements Ci+2(ij+2, V1+2) of 
just computed eigenvectors: 

|$H-2,-,n-l) = C l+2 {i l+2l V l+U V l+2 )\il + 2)\^r n ~ 1 )^ 

from which we can derive 

Qi+2(ii+2,vi + i,vi +2 ) = Ci +2 (ii +2 ,vi + i,vi +2 )/Vi +2 (vi +2 ). 
Thus we have obtained 

|^i+2,...,n-l) = Q^ 2 (i l+2iVl+ljVl+2 )\i l+2 )\ Vl+2 ). 

Third, we are going to compute the tensor 
<3i+i(ii+i,Ui, i>z+i). By the definition of this tensor, 
we have 



li+i{k+i,vi,vi +1 ) 



(«2i"- , l<<j+ilTO-' n - 1 l*) 



Vi(vi)Vi+i(vi + i) 

^(«l)^+l("tfl)i,i 1+a . l _ 1 « l+a l v 

x (<5t+^-'"- 1 |ii+ 2 )|«i+2)) 9(ii,ii+i,ii+2,Ui_i,Ui+2) 
We substitute the equations 



I* 



o,...,i 



) = <5i(ii,Wi-i,«i)|ui-i)|ii) 



| $ i+2,...,n-l) = Q l+2 (i l+2jVl+uVl+2 )\i l+2 )\ Vl+2 ) 



h+i(ii+i,vi,vi+i) 



1 



E 



Vi(vi)V l+1 (vi +1 ) im+2Vl _ lVl+2 
[Vi-i(vi-i)] 2 Qi (ii,vi--L,vi)Qi + 2 (ii +2 ,vi +1 ,vi +2 ) 
x [VJ+ 2 (wz+2)] 2 9(ii, ii+i,iz+2,Ui-i, ^+2) 



In this way, we have obtained the updated tensors 

Qi(ii,vi-i,vi), Vi(vi), Qi + i(ii +1 ,vi,vi+i), 
and Qi +2 {ii+2, Ui+i, w/ +2 ). 

The described process to simulate a three-qubit gate is 
implemented as the function "applyU8" in ZKCM_QC. We 
will see an example to use the function in the following. 

Appendix A.l. Example 

Here is an example program to use a three-qubit oper- 
ation. It simulates the simplest case of Grover's quantum 
search (38j . One may also see the simulation performed in 
Sec. 14.31 for another example. 

For a brief sketch of the quantum search, suppose that 
there is an oracle function / : {0,1}" — > {0,1} that has r 
solutions w (namely, strings w such that /(w) = 1). Clas- 
sical search for finding a solution takes 0(2 n /r) queries. In 
contrast, the Grover search finds a solution in 0(y/2 n /r) 
queries. In particular when n = 4 and r = 1, a single query 
is enough for the Grover search. For example, consider the 
parent data set {00,01,10,11} and the solution 01 for a 
certain oracle with a single oracle bit. A single Grover 
iteration R maps \s) = §(|00) + |01) + 1 10) + |ll))|-) 
to |01)|-) o , where R = U s U f with Uf = 1 - 2|01)(01| 
and U s = 1 — 2|s)(s| acts on the left side qubits; |— ) Q = 
(| 0)o — |l)o)/v / 2 is a state of the oracle qubit (the right- 
most qubit). Uf is a unitary operation corresponding to 
/ and U s is a so-called inversion-about-average operation. 
To implement Uf and U s , we utilize the fact that flipping 
|— ) changes the phase factor ±1, following a common im- 
plementation [111 ] . 

Here, we consider a very simple oracle structure, namely 
that, it flips |— ) when the left side qubits are in 1 01) using 
a very straight-forward circuit interpretation. Note that 
there is no realistic benefit to perform a search in such a 
case because we know the solution beforehand. Usually a 
search is performed because we cannot guess the solutions 
by the circuit appearance (consider, e.g, an instance of a 
satisfiability problem [H^] ) . The sample program to simu- 
late the Grover search for the present simple case is shown 
in Listing [21 



#include " zkcm_qc . hpp " 

int main ( ) 

{ 

zkcm_ se t _def aul t _pr e c (256) ; 
mps M (3) ; 
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//Prepare the state for the parent data set and 

//the initial state of the oracle qubit . 

M. applyU (tensor2tools : : Hadamard , 0) ; 

M. applyU(tensor2tools : : Hadamard , 1) ; 

M . applyU (tensor2tools:: PauliX , 2) ; 

M. applyU (tensor2tools : : Hadamard , 2) ; 

//Oracle function with the solution 01. 

zkcm_matrix U_f (8, 8); 

U_f . set_to_identity () ; 

U_f (2 , 2) = U_f (3 , 3) =0; 

U_f (2 , 3) = U_f (3 , 2) = 1 ; 

//Inversion- about -average operation 
zkcm_matrix U_s(8, 8); 
U_s . set_to_identity () ; 
U_s(0, 0) = U_s(l, 1) = 0; 
U_s(0, 1) = U_s(l, 0) = 1; 

zkcm_matrix H = zkcm_matr ix ( " [1 , 1; 1, -1]") 

/ sqrt ( zkcm_ clas s ( 2) ) ; 
zkcm_matrix I = zkcm_matr ix ( " [1 , 0; 0, 1]"); 
H = tensorprod ( tensorpr od (H , H) , I); 
U_s = H * U_s * H;//This is l-2|s><s|. 

std::cout << "Initially, Prob(01)=" 

<< M . RD0_block (0 , l).get(l, 1) 
<< std : : endl 

<< "Go into Grover ' s iteration..." 

<< std : : endl ; 
for (int i = 0; i < 8; i+ + ) 
{ 

M . applyU8 (U_f , , 1 , 2) ; 

M . applyU8 (U_s , 0, 1, 2); 

std::cout << "After " << i + 1 

<< " times iteration, Prob(01)=" 
<< M . RD0_block (0 , l).get(l, 1) 
<< std : : endl ; 

} 

zkcm_quit ( ) ; 
return 0; 

Listing 3: simple_q_search.cpp 

The program is compiled and executed in the following 
way. 

[user@localhost foo]$ C++ -o simple_q_search \ 
simple_q_search. cpp -lzkcm_qc -lzkcm -lmpfr \ 
-lgmp -lgmpxx 

[user@localhost foo]$ . /simple_q_search 
Initially , Prob (01) =2 . 500000000e-01 
Go into Grover's iteration... 

After 1 times iteration, Prob(01)=l . 000000000e+00 
After 2 times iteration, Prob(01)=2 . 500000000e-01 
After 3 times iteration, Prob(01)=2 . 500000000e-01 
After 4 times iteration, Prob(01)=l . 000000000e+00 
After 5 times iteration, Prob(01)=2 . 500000000e-01 
After 6 times iteration, Prob(01)=2 . 500000000e-01 
After 7 times iteration, Prob(01)=l . 000000000e+00 
After 8 times iteration, Prob(01)=2 . 500000000e-01 

We see that the success probability, i.e., the probability to 
find 01 in the left side qubits if we measure them in the 
computational basis, is unity after a single R is applied. By 
continuing the iteration, the success probability oscillates. 

We have seen a sample program that uses a three-qubit 
gate by calling the function "applyU8" . In addition to this 



function, there are functions for typical three-qubit gates. 
They are briefly mentioned below. 

Appendix A. 2. Typical three-qubit gates 

The first typical three-qubit gate is the CCNOT 
gate already introduced in Sec. 14.31 which flips a sin- 
gle target qubit t under the condition that two con- 
trol qubits a and b are in |11). ZKCM_QC has 
a particular member function (of class "mps") for it: 
"mps & mps::CCN0T (int a, int b, int t);". The 
second typical one is the controlled-SWAP (CSWAP) 
gate, which swaps two target qubits p and q un- 
der the condition that the control qubit c is in |1). 
The member function corresponding to this gate is 
"mps & mps::CSWAP (int c, int p, int q);". 

In summary of this appendix, the process to simulate a 
three-qubit gate operation in a time-dependent MPS simu- 
lation has been theoretically described. A sample program 
to simulate a simple quantum search has been provided, 
which uses one of the three-qubit gate functions of the 
ZKCIvLQC library. 
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